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Abstract 


We formulate a local form of the bipartite ranking problem where the goal is to focus on the best 
instances. We propose a methodology based on the construction of real-valued scoring functions. 
We study empirical risk minimization of dedicated statistics which involve empirical quantiles of 
the scores. We first state the problem of finding the best instances which can be cast as a clas- 
sification problem with mass constraint. Next, we develop special performance measures for the 
local ranking problem which extend the Area Under an ROC Curve (AUC) criterion and describe 
the optimal elements of these new criteria. We also highlight the fact that the goal of ranking the 
best instances cannot be achieved in a stage-wise manner where first, the best instances would be 
tentatively identified and then a standard AUC criterion could be applied. Eventually, we state 
preliminary statistical results for the local ranking problem. 

Keywords: ranking, ROC curve and AUC, empirical risk minimization, fast rates 


1. Introduction 


The first takes all the glory, the second takes nothing. In applications where ranking is at stake, 
people often focus on the best instances. When scanning the results from a query on a search en- 
gine, we rarely go beyond the one or two first pages on the screen. In the different context of credit 
risk screening, credit establishments elaborate scoring rules as reliability indicators and their main 
concern is to identify risky prospects especially among the top scores. In medical diagnosis, test 
scores indicate the odds for a patient to be healthy given a series of measurements (age, blood pres- 
sure, ...). There again a particular attention is given to the ’best” instances not to miss a possible 
diseased patient among the highest scores. These various situations can be formulated in the setup 
of bipartite ranking where one observes i.i.d. copies of a random pair (X,Y) with X being an ob- 
servation vector describing the instance (web page, debtor, patient) and Y a binary label assigning 
to one population or the other (relevant vs. non relevant, good vs. bad, healthy vs. diseased). In 
this problem, the goal is to rank the instances instead of simply classifying them. There is a grow- 
ing literature on the ranking problem in the field of Machine Learning but most of it considers the 
Area under the ROC Curve (also known as the AUC) criterion as a measure of performance of the 
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ranking rule (Cortes and Mohri, 2004; Freund et al., 2003; Rudin et al., 2005; Agarwal et al., 2005). 
In a previous work, we have mentioned that the bipartite ranking problem under the AUC crite- 
rion could be interpreted as a classification problem with pairs of observations (Clémencon et al., 
2005). But the limit of this approach is that it weights uniformly the pairs of items which are badly 
ranked. Therefore it does not permit to distinguish between ranking rules making the same number 
of mistakes but in very different parts of the ROC curve. The AUC is indeed a global criterion 
which does not allow to concentrate on the best” instances. Special performance measures, such 
as the Discounted Cumulative Gain (DCG) criterion, have been introduced by practitioners in order 
to weight instances according to their rank (Jarvelin and Kekäläinen, 2000) but providing theory 
for such criteria and developing empirical risk minimization strategies still is a very open issue. 
Recent works by Rudin (2006), Cossock and Zhang (2006), and Li et al. (2007) reveal that there are 
several possibilities when designing ranking algorithms with focus on the top-rated instances. In 
the present paper, we focus on statistical aspects rather than algorithmic. We extend the results of 
our previous work in Clémencon et al. (2005) and set theoretical grounds for the problem of local 
ranking. The methodology we propose is based on the selection of a real-valued scoring function for 
which we formulate appropriate performance measures generalizing the AUC criterion. We point 
out that ranking the best instances is an involved task as it is a two-fold problem: (i) find the best 
instances and (ii) provide a good ranking on these instances. The fact that these two goals cannot 
be considered independently will be highlighted in the paper. Despite this observation, we will first 
formulate the issue of finding the best instances which is to be understood as a toy problem for our 
purpose. This problem corresponds to a binary classification problem with a mass constraint (where 
the proportion uo of +1 labels predicted by the classifiers is fixed) and it might present an interest 
per se. The main complication here has to do with the necessity of performing quantile estimation 
which affects the performance of statistical procedures. Our proof technique was inspired by the 
former work of Koul (2002) in the context of R-estimation where similar statistics, known as linear 
signed rank statistics, arise. By exploiting the structure of such statistics, we are able to establish 
noise conditions in a similar way as in Clémencon et al. (To appear) where we had to deal with per- 
formance criteria based on U-statistics. Under such conditions, we prove that rates of convergence 
up to n™?/3 can be guaranteed for the empirical risk minimizer in the classification problem with 
mass constraint. Another contribution of the paper lies in our study of the optimality issue for the 
local ranking problem. We discuss how focusing on best instances affects the ROC curve and the 
AUC criterion. We propose a family of possible performance measures for the problem of ranking 
the best instances. In particular, we show that widespread ideas in the biostatistics literature about 
the partial AUC (see Dodd and Pepe, 2003) turn out to be questionable with respect to optimality 
considerations. We also point out that the empirical risks for local ranking are closely related to 
generalized Wilcoxon statistics. 


The rest of the paper is organized as follows. We first state the problem of finding the best 
instances and study the performance of empirical risk minimization in this setup (Section 2). We 
also explore the conditions on the distribution in order to recover fast rates of convergence. In 
Section 3 we formulate performance measures for local ranking and provide extensions of the AUC 
criterion. Eventually (Section 4), we state some preliminary statistical results on empirical risk 
minimization of these new criteria. 
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2. Finding the Best Instances 


In the present section, we have a limited goal which is only to determine the best instances without 
bothering with their order in the list. By considering this subproblem, we will identify the main 
technical issues involved in the sequel. It also permits to introduce the main notations of the paper. 

Just as in standard binary classification, we consider the pair of random variables (X,Y ) where X 
is an observation vector in a measurable space X and Y is a binary label in {—1, +1}. The distribution 
of (X,Y) can be described by the pair (u,n) where u is the marginal distribution of X and 1 is the 
a posteriori distribution defined by n(x) = P{Y =1|X =x}, Vx € X. We define the rate of best 
instances as the proportion of best instances to be considered and denote it by uo € (0,1). We 
denote by Q(N, 1 — uo) the (1 —u,)-quantile of the random variable n(X). Then the set of best 
instances at rate uo is given by: 








Ci, = {x E X| N(x) > O(N, 1 —uo)} - 
We mention two trivial properties of the set C}, which will be important in the sequel: 
e MASS CONSTRAINT: we have u(C;,) =P{Xe Cr = Uo, 


e INVARIANCE PROPERTY: as a functional of n, the set C$, is invariant to strictly increasing 
transforms of 1. 


The problem of finding a proportion uo of the best instances boils down to the estimation of 
the unknown set C}, on the basis of empirical data. Before turning to the statistical analysis of the 
problem, we first relate it to binary classification. 


2.1 A Classification Problem with a Mass Constraint 


A classifier is a measurable function g : X — {—1,+1} and its performance is measured by the 
classification error L(g) = P{Y #g(X)}. Let uo € (0,1) be fixed. Denote by g}, = alc; — 1 the 
classifier predicting +1 on the set of best instances C}, and -1 on its complement. The next proposi- 
tion shows that g} is an optimal element for the problem of minimization of L(g) over the family 
of classifiers g satisfying the mass constraint P{g(X ) = 1} = uo. 


Proposition 1 For any classifier g : X — {—1,+1} such that g(x) = 2lc(x)— ı for some subset C 
of X and u(C) = P{g(X) = 1} = uo, we have 


Li, = L(g) < Lg) - 


Furthermore, we have 














Lis =1—Q(N, 1 — Uo) + (1 — Uo) (2O(N, 1 — uo) — 1) —E (M(X) — Q(N, 1 — uo)|), 





and 














L(g) —L(gi,,) = 2E (in(X) — Q(n, 1 — uo )| Ic; ac(X)), 


where A denotes the symmetric difference operation between two subsets of X. 
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PROOF. For simplicity, we temporarily change the notation and set g = Q(N, 1 — uo). Then, for any 
classifier g satisfying the constraint P{g(X) = 1} = uo, we have 











L(g) =E (M(X) — qI + (9 — 0X) g(x) + (2 — uo )g + (1G) Mo . 





The statements of the proposition immediately follow. a 


There are several progresses in the field of classification theory where the aim is to introduce 
constraints in the classification procedure or to adapt it to other problems. We relate our formulation 
to other approaches in the following remarks. 


Remark 2 (CONNECTION TO HYPOTHESIS TESTING). The implicit asymmetry in the problem due 
to the emphasis on the best instances is reminiscent of the statistical theory of hypothesis testing. 
We can formulate a test of simple hypothesis by taking the null assumption to be Ho : Y = —1ı 
and the alternative assumption being H, : Y = +1. We want to decide which hypothesis is true 
given the observation X. Each classifier g provides a test statistic g(X). The performance of 
the test is then described by its type I error a(g) = P{g(X) =1 | Y =—1} and its power B(g) = 
P{g(X) =1|¥Y =+1}. We point out that if the classifier g satisfies a mass constraint, then we can 
relate the classification error with the type I error of the test defined by g through the relation: 


L(g) =2(1—p)a(g)+p—uo 


where p = P{Y = 1}, and similarly, we have: L(g) = 2p(1—B(g)) — p—Uuo. Therefore, the optimal 
classifier minimizes the type I error (maximizes the power) among all classifiers with the same mass 
constraint. In some applications, it is more relevant to fix a constraint on the probability of a false 
alarm (type I error) and maximize the power. This question is explored in a recent paper by Scott 
(2005) (see also Scott and Nowak, 2005). 


Remark 3 (CONNECTION WITH REGRESSION LEVEL SET ESTIMATION) We mention that the es- 
timation of the level sets of the regression function has been studied in the statistics literature (Cav- 
alier, 1997) (see also Tsybakov, 1997 and Willett and Nowak, 2006) as well as in the learning 
literature, for instance in the context of anomaly detection (Steinwart et al., 2005; Scott and Daven- 
port, 2006, to appear; Vert and Vert, 2006). In our framework of classification with mass constraint, 
the threshold defining the level set involves the quantile of the random variable n(X ). 


Remark 4 (CONNECTION WITH THE MINIMUM VOLUME SET APPROACH) Although the point of 
view adopted in this paper is very different, the problem described above may be formulated in the 
framework of minimum volume sets learning as considered in Scott and Nowak (2006). As a matter 
of fact, the set C}, may be viewed as the solution of the constrained optimization problem: 


min P{X EC|Y=-—1} 
over the class of measurable sets C, subject to 
P{X EC}>u,. 


The main difference in our case comes from the fact that the constraint on the volume set has to be 
estimated using the data while in Scott and Nowak (2006) it is computed from a known reference 
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measure. We believe that learning methods for minimum volume set estimation may hopefully be 
extended to our setting. A natural way to do it would consist in replacing conditional distribution 
of X given Y = —1 by its empirical counterpart. This is beyond the scope of the present paper but 
will be the subject of future investigation. 


2.2 Empirical Risk Minimization 


We now investigate the estimation of the set C}, of best instances at rate uo based on training data. 
Suppose that we are given n i.i.d. copies (X,,Y,),--- , (Xn, Yn) of the pair (X,Y). Since we have the 
ranking problem in mind, our methodology will consist in building the candidate sets from a class 
S of real-valued scoring functions s : X — R. Indeed, we consider sets of the form 


C= C; u ={x E€ X | s(x) > Q(s,1ı—uo)} , 


where s is an element of § and Q(s,1 — uo) is the (1 — uo )-quantile of the random variable s(X). 
Note that such sets satisfy the same properties of C7, with respect to mass constraint and invariance 
to strictly increasing transforms of s. 

From now on, we will take the simplified notation: 


L(s) = L(s,uo) = L(Cs) = P{Y - (s(X)—Q(s,1 —uo)) < 0} . 


A scoring function minimizing the quantity 
1 n 
L,(s) ==} _WY;- (s(Xi) —Q(s,1—uo)) <o}. 
i=1 


is expected to approximately minimize the true error L(s), but the quantile depends on the unknown 
distribution of X. In practice, one has to replace Q(s,1—u,) by its empirical counterpart O(s,1—uo) 
which corresponds to the empirical quantile. We will thus consider, instead of L,,(s), the empirical 
error: 


En(s) == Y_HY;-(s(X))—Ols,1—wo)) <o}. 


Note that Î„(s) is a complicated statistic since the empirical quantile involves all the instances 
X,,...,Xn. We also mention that £,,(s) is a biased estimate of the classification error L(s) of the 
classifier g,(x) = all{s(x) > O(s,1—u,)}—1. 

We introduce some more notations. Set, for all t € R: 


e F(t) =P{s(X) <t} 


* Gs(t) =P{s(X) <t |Y = +1} 





° H(t) = P{s(X) <t |Y =—1}. 


The functions F, (respectively Gs, H,) denote the cumulative distribution function (cdf) of s(X) 
(respectively, given Y = 1, given Y =—1). We recall that the definition of the quantiles of (the 
distribution of) a random variable involves the notion of generalized inverse F™* of a function F: 


F(z) =inf{t € R | F(t) >z}. 
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Thus, we have, for all v € (0,1): 
eee and Q(s,v) = Ê> (v) 


where Ê; is the empirical cdf of s(X) = i>, Hs(X;) <t}, vt E€ R. 

Without loss of generality, we n assume that all es functions in § take their values in 
(0,4) for some A > o. We now turn to study the performance of minimizers of L,,(s) over a class $ 
of scoring functions defined by 

= argmin Î„ (s). 
sES 


Our first main result is an excess risk bound for the empirical risk minimizer $, over a class $ 
of uniformly bounded scoring functions. In the following theorem, we consider that the level sets 
of scoring functions from the class $ form a Vapnik-Chervonenkis (VC) class of sets. 


Theorem 5 We assume that 


(i) the class S is symmetric (that is, if s € S then’—s € S) and is a VC major class of functions 
with VC dimension V. 


(ii) the family K ={ G,,Hs : s E€ S } of cdfs satisfies the following property: any K € K has left 
and right derivatives, denoted by K k and K! , and there exist strictly positive constants b, B 
such that Y(K,t) € K x (0,A), 


b<|Ki(t)|<B and b<|K'(t)|<B. 


For any ò > 0, we have, with probability larger than ı — 


L(S,) — inf L(s <ul fl ua 
sES 


for some positive constants C,,Co. 








The following remarks provide some insights on conditions (i) and (ii) of the theorem. 


Remark 6 (ON THE COMPLEXITY ASSUMPTION) On the terminology of major sets and major 
classes, we refer to Dudley (1999). In the proof, we need to control empirical processes indexed by 
sets of the form {x : s(x) >t}or {x : s(x) < t}. Condition (i) guarantees that these sets form a VC 
class of sets. 


Remark 7 (ON THE CHOICE OF THE CLASS $ OF SCORING FUNCTIONS) In order to grasp the 
meaning of condition (ii) of the theorem, we consider the one-dimensional case with real-valued 
scoring functions. Assume that the distribution of the random variable X; has a bounded density f 
with respect to Lebesgue measure. Assume also that scoring functions s are differentiable except, 
possibly, at a finite number of points, and derivatives are denoted by s'. Denote by f, the density of 
s(X). Lett € (0,A) and denote by x,, ..., Xp the real roots of the equation s(x) = t. We can express 
the density of s(X ) thanks to the change-of-variable formula (see, for instance, Papoulis, 1965): 
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Scoring functions 











Figure 1: Typical example of a scoring function. 


This shows that the scoring functions should not present neither flat nor steep parts. We can take 
for instance, the class S to be the class of linear-by-parts functions with a finite number of local 
extrema and with uniformly bounded left and right derivatives: Vs € S, Yx, m < s' (x) < M andm < 
s! (x) < M for some strictly positive constants m, and M (see Figure 1). Note that any subinterval 
of [0,A] has to be in the range of scoring functions s (if not, some elements of K will present a 
plateau). In fact, the proof requires such a behavior only in the vicinity of the points corresponding 
to the quantiles Q(s,1—Uug) for all s € S. 


PROOF. Set vo = 1-— uo. By a standard argument (see, for instance, Devroye et al., 1996), we have: 


L(8,) — inf L(s) < 2sup \Zn(s) —L(s)| 
ses SES 


< 2sup|£n(s) —Ln(s)| + 28up En (s) —L(s)| . 
SES ses 
Note that the second term in the bound is an empirical process whose behavior is well-known. 
In our case, assumption (i) implies that the class of sets {x : s(x) > Q(s,vo)} indexed by scoring 
functions s has a VC dimension smaller than V. Hence, we have by a concentration argument 
combined with a VC bound for the expectation of the supremum (see, for instance, Lugosi), for any 
ô > o, with probability larger than 1 — 6, 


sup|Ln(s) —L(s)| < ef T 8 
n n 


ses 





for universal constants c,c’. 
The novel part of the analysis lies in the control of the first term and we now show how to handle 
it. Following the work of Koul (2002), we set the following notations: 


M(s,v) =P{Y. (s(X)—Q(s,v)) < o}, 
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v) ==) WY- (8%) —Ols,v)) <0} —MIs,9) . 


and note that U,,(s,v) is centered. In particular, we have: 
Ln(s) =Un(s,Vo) +M(s, vo) . 
As Q(s,v) = F= (v), we have Q(s, Fso F-(v)) = ÊẸ™ (v) = O(s,v) and then 
Pals) = Un(8,Fs0F, (vo)) +M(s, Fs 0 Ê (vo)) - 


Note that M (s, Fso Ê (vo)) =P {Y . (s(X) — O(s,v)) <o| Dy} where D,, denotes the sample (X4, Y1), -++ , (Xn, Yn). 
We then have the following decomposition, for any s € § and vo € (0,1): 


[Zn (s) —Ln(s)| = |U, (8, Fs 0 Fy (vo)) —Un(s, vo) | +|M(s,F,oF,*(vo)) —M(s,v0)| . 


Recall the notation p = P{Y = 1}. Since M(s,v) = (1—p)(1—H, oF, *(v)) + pG; oF." (v) and 
F, = pG;+(1—p)Hs, the mapping v +> M(s,v) is Lipschitz by assumption (ii). Thus, there exists a 
constant K < oo, depending only on p, b and B, such that: 


|M(s, F;0F,*(vo)) —M(s,Vo)| < x |F; o Ê" (vo) — vo]. 
Moreover, we have, for any s € S: 


[Fo F; (va) —vo| ` |F; o Ê (vo) -of 


S 


(vo )| F |Ê o Ê (vo) —vo| 


1 
< sup |F(t)—F,(t)|+-. 
tE(0,d) n 


Here again, we can use assumption (i) and a eee VC bound from Lugosi in order to control 
the empirical process, with probability larger than 1 — 


In(1/8) 
sup |F;(t) |< eal TE alae) 
(s,t)€S x (0,A) 
for some constants c,c’. 


It remains to control the term involving the process U,,: 








[Un (s, F0 Ê (vo))—Un(s,vo)| < sup |Un(s,v) —Un(s,vo)] <2 sup |Un(s,v)]. 


v€(0,1) ve (0,1) 


Using that the class of sets of the form {x : s(x) > Q(s,v)} for v € (0,1) is included in the class 
of sets of the form {x : s(x) >t} where t € (0,4), we then have 


sup |U,(s,v)|< sup LDG Xi) —t) <o}— P{Y- (s( X)—t) <o} 


ve(o,1) tE(o,A) 


which leads again to an empirical process indexed by a VC class of sets and can be bounded as 
before. a 
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2.3 Fast Rates of Convergence 


We now propose to examine conditions leading to fast rates of convergence (faster than n~*/?). It 
has been noticed (see Mammen and Tsybakov, 1999; Tsybakov, 2004; Massart and Nédélec, 2006) 
that it is possible to derive such rates of convergence in the classification setup under additional 
assumptions on the distribution. We propose here to adapt these assumptions for the problem of 
classification with mass constraint. 

Our concern here is to formulate the type of conditions which render the problem easier from 
a Statistical perspective. For this reason and to avoid technical issues, we will consider a quite 
restrictive setup where it is assumed that: 


e the class § of scoring functions is a finite class with N elements, 


e an optimal scoring rule s* is contained in S. 


We have found that the following additional conditions on the distribution and the class § allow 
to derive fast rates of convergence for the excess risk in our problem. 


(iii) There exist constants Œ € (0,1) and D > o such that, for all t > o, 
P{ln(X) —O(n,1—uo)| <t}< Dt. 


(iv) the family K ={ G,,H; : s E S } of cdfs satisfies the following property: for any s € S, Gs 


and H, are twice differentiable at Q(s, 1 — uo) = F+ (1 — uo). 


Note that condition (iii) simply extends the standard low noise assumption introduced by Tsy- 
bakov (2004) (see also Boucheron et al., 2005, for an account on this) where the level 1/2 is replaced 
by the (1 — uo )-quantile of n(X). Condition (iv) is a technical requirement needed in order to derive 
an approximation of the statistics involved in empirical risk minimization. 


Remark 8 (CONSEQUENCE OF CONDITION (III)) We recall here the various equivalent formula- 
tions of condition (iii) as they are described in Section 5.2 from the survey paper by Boucheron et al. 
(2005). A slight variation in our setup is due to the presence of the quantile Q(N,1 — uo) but we 
can easily adapt the corresponding conditions. Hence, we have, under condition (iii), the variance 
control, for any s € S: 


Var(HY # 21c, (X) — 1}— HY # alle; (X) —1}) < c (L(s) — Li)” ; 


or, equivalently, 
E (Ic,ac} (X)) <c (L(s) — a 3 


Recall that La(s) = + }_ ;—, HY; (s(X;)— Q(s,1—uo)) < 0}. We point out that La(s) is not an empiri- 
cal criterion since the quantile Q(s, 1 — uo ) depends on the distribution. However, we can introduce 
the minimizer of this functional: 
Sn =argminL,(s) , 
SES 
for which we can use the same argument as in the classification setup. We then have, by a standard 
argument based on Bernstein’s inequality (which will be provided for completeness in the proof of 


Theorem 10 below), with probability 1 — 8, 
log(N/ 2) za 


Hasi <e( 
g n 
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for some positive constant c. We will show below how to obtain a similar rate when the true quantile 
. . . . aA . . . . . . 
Q(s,1ı — uo) is replaced by the empirical quantile Q(s,1—Uu,) in the criterion to be minimized. 


We point out that conditions (ii) and (iii) are not completely independent. We offer the following 
proposition which will be useful in the sequel. 


Proposition 9 If (Gy, Hn) belongs to the class K fulfilling condition (ii), then Fy is Lipschitz and 
condition (iii) is satisfied with a = 1/2. 


PROOF. We recall that Fy = pGy + (1 — p)Hy and assume for simplicity that Gy and Hy are 
differentiable. By condition (ii), we then have |F,| = p|Gy| + (1— p)|Hy| < pB + (1— p)B =B. 
Set q = Q(N, 1 — uo). Then, by the mean value theorem, there exists a constant c such that, for all 
t>o: 

P{In(X) -q| <0} = Fylt +4)— Fylt +q) < Blt +q- (—t +q)) = 2Bt . 


We have proved that condition (iii) is fulfilled with D = 2B and a = 1/2. a 


The novel part of the analysis below lies in the control of the bias induced by plugging the 
empirical quantile O(s, 1 — uo) in the risk functional. The next theorem shows that faster rates of 
convergence up to the order of n™?/3 can be obtained under the previous assumptions. 


Theorem 10 We assume that the class S of scoring functions is a finite class with N elements, 
and that it contains an optimal scoring rule s*. Moreover, we assume that conditions (i)-(iv) are 
satisfied. We recall that 8, = arg min, <5 L,,(s). Then, for any & > o, we have, with probability 1 — 8: 


ewa) 5 


Le- <e ( 
0 n 


for some constant c. 


Remark 11 (ON THE RATE n~2/3) This result highlights the fact that rates faster than the one ob- 
tained in Theorem 5 can be obtained in this setup with additional regularity assumptions. However, 
it is noteworthy that the standard low noise assumption (iii) is already contained, by Proposition 
9, in assumption (ii) which is required in proving the typical n—*/? rate. The consequence of this 
observation is that there is no hope of getting rates up ton * unless assumption (ii) is weakened. 


Remark 12 (ON THE ASSUMPTION s* € S) This assumption is not important and can be removed. 
For a neat argument, check the proof of Theorem 5 from Clémencon et al. (To appear) which uses a 
result by Massart (2006). 


The proof of the previous theorem is based on two arguments: the structure of linear signed rank 
Statistics and the variance control assumption. The situation is similar to the one we encountered 
in Clémençon et al. (To appear) where we were dealing with U-statistics and we had to invoke 
Hoeffding’s decomposition in order to grasp the behavior of the underlying U-processes. Here we 
require a similar argument to describe the structure of the empirical risk functional f,,(s) under 
study. This statistic can be interpreted as a linear signed rank statistic and the key decomposition 
has been used in the context of nonparametric hypotheses testing and R-estimation. We mainly 
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refer to Hajek and Sidak (1967), Dupac and Hajek (1969), Koul (1970), and Koul and R.G. Staudte 
(1972) for an account on rank statistics. 

We now prepare for the proof by stating the main ideas in the next propositions, but first we 
need to introduce some notations. Set: 











Vv € [0,1], K(s,v) =E(YI{s(X) < O(s,v)}) = pG,(Q(s,v)) — (1 — p)As(O(s,v)), 





Ruls,v) ==) Yills(Xi) < Ols,v)}. 


Then we can write: 


L(s) =1—p+K(s,1—uo), 


a(s) == +Ry(s,1—uo) , 


where n- = }_;_ HY; = —1}. 
We note that the statistic Î, (s) is related to linear signed rank statistics. 


Definition 13 (Linear signed rank statistic) Consider Z,,...,Z, ani.i.d. sample with distribution 
F and a real-valued score generating function ®. Denote by RF = rank(|Z;|) the rank of |Z;| in the 
sample |Z,|,...,|Zy|. Then the statistic 


n 


ri 
žo (= -) sgn(Z;) 


i=1 





is a linear signed rank statistic. 
Proposition 14 For fixed s and v, the statistic R,,(s,v) is a linear signed rank statistic. 


PROOF. Take Z; = Yjs(X;). The random variables Z; have their absolute value distributed according 
to F, and have the same sign as Y;. It is easy to see that the statistic R,,(s,v) is a linear signed rank 
statistic with score generating function ®(x) = I).<,. | 


A decomposition of Hoeffding’s type for such statistics can be formulated. Set first: 


Zn(sv) =~) (Wi —K'(s,v)) Hs(Xi) < Qls,v)}—K(s,v) +vK'(s,v) , 


i=) 


where K'(s,v) denotes the derivative of the function v =œ K(s,v). Note that Z,(s,v) is a centered 
random variable with variance: 


07(s,v) =v—K(s,v)? +v(1 —v)K?(s,v) —2(1—v)K’(s,v)K(s,v) . 
The next result is due to Koul (1970) and we provide an alternate proof in the Appendix. 
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Proposition 15 Assume that condition (iv) holds. We have, for all s € S and v € [0,1]: 
Rals,v) =K(s,v) +Zn(s,v) +.An(s) . 


with 
An(s) =Op(n +) as nao. 


This asymptotic expansion highlights the structure of the statistic £,,(s) for fixed s: 
A n 
La(s) = a +K(s, 1 — uo) +Zn(5s,1 — uo) +An(s) . 


Once centered, the leading term Z, (s, 1 — uo ) is an empirical average of i.i.d. random variables (of a 
stochastic order of n™*/ 2) and the remainder term ^A, (s) is of a stochastic order of n~*. The nature of 
the decomposition of £,,(s) is certainly unexpected because the leading term contains an additional 
derivative term given by K’(s, 1 — uo) (v—I{s(X;) < Q(s,1—u,)}). The revelation of this fact is one 
of the major contributions in the work of Koul (2002). 

Now, in order to establish consistency and rates-of-convergence-type results, we need to fo- 
cus only on the leading term which carries most of the statistical information, while the remainder 
needs to be controlled uniformly over the candidate class $. As a consequence, the variance con- 
trol assumption will only concern the variance of the kernel h, involved in the empirical average 
Zn(S,1—U,) and defined as follows: 


hs(Xi, Yi) = (¥i—K'(s,v)) Hs(X:) < O(s,v)}—K(s,v) + vK'(s,v) , 


We then have 


n 


Zn(8,v) —Znls*,v) ==) (hX Yi) — hs (Xi Yi) - 


i=1 
Proposition 16 Fix v € [0,1]. Assume that condition (iii) holds. Then, we have, for all s € S: 
Var (hs (Xi, Yi) — hs (Xi, Y;)) < e(L(s)-L(s"))*, 
for some constant c. 


PROOF. We first write that: 


hs(Xi, Yi) — he (Xi, Yi) =I +UI +I -+IV +V 





where 


I = Y; (Hs(X;) < Qls,v)}—Hs*(X;) < O(s*,v)}), 

IH = (K'(s*,v)—K'(s,v)) Hs*(X;) < Q(s*,v)}, 

II = K"(s,v) (H*(X:) < Q(s*,v)}—Hs(X;:) < O(s,v)}), 
IV = K(s*,v)—K(s,v), 

V = v(K'(s,v)—K'(s*,v)) . 
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By Cauchy-Schwarz inequality, we only need to show that the expected value of the square of 
these quantities is smaller than (L(s) — L*)™ up to some multiplicative constant. 
Note that, by definition of K, we have: 


I = (L'(s*,v)—L"(s,v)) Is*(Xi) < Q(s*,v)}, 
IV = L(s*)—L(s), 


V = v(L'(s,v)—L'(s*,v)) 


where L’(s,v) denotes the derivative of the function v > L(s,v). It is clear that, for any s, we 
have L(s,v) =L(s*,v) implies that L'(s,v) = L’(s*,v) otherwise s* would not be an optimal scoring 
function at some level v’ in the vicinity of v. Therefore, since S is finite, there exists a constant c 
such that 

(L'(s,v) —L'(s*,v))? < e(L(s) —L*)® 


and then E (I 2) and E(V?) are bounded accordingly. 
Moreover, we have: 












































E(I?) < E(le,ac. (X)) 








< c(L(s) —L(s*))* 


for some positive constant c, by assumption (iii). 
Eventually, by assumption (ii), we have that K’(s,v) is uniformly bounded and thus, the term 
(IL?) can be handled similarly. a 














Proof of Theorem 10. Set vo = 1 — uo. First notice that 8}, = arg mines R,,(s,1—u,). We then have 


L(S,) —L(s*) = K($, Vo) —K(s*, vo) 
< Ri(s*,Vo) — Rpa lsn Vo) —(K(s*,vo) — K($, vo)) 


< Zn(s", Vo) —Zn(Si,Vo)) + 2sup |An(s)| 
ses 
where we used the decomposition of the linear signed rank statistic from Proposition 15 to obtain 
the last inequality. 

By Proposition 15, we know that the second term on the right hand side is of stochastic order n~ 
since the class S is of finite cardinality. It remains to control the leading term Z,(s*, Vo) — Zn ($n, Vo). 
At this point, we will use the same argument as in Section 5.2 from Boucheron et al. (2005). 

Denote by C = sup, xy |As(x,y)| and by 07(s) = Var (hs (Xi, Yi) — hs (X; Y;)). By Bernstein’s in- 
equality for averages of upper bounded and centered random variables (see Devroye et al., 1996) 
and the union bound, we have, with probability 1 — ð, for all s € S: 


1 





20?(s)log(N/ò) n 2Clog(N/ò) 


n 3n 





Zn(s*, vo) — Zn (s, Vo) < y 
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< 


| Cones log(N/8) , 2Clog(N/8) 
n i 3n 
thanks to the variance control obtained in Proposition 16. Since this inequality holds for any s, it 


holds in particular for s = $». Therefore, we have obtained the following result, with probability 


1— Ô: 





2c(L(S;,) —L*)* log(N/8) | 2c'log(N/8) 


n 3n 





L(S;,) —L(s") < y 


for some constants c, c’. At the cost of increasing the multiplicative constant factor, we can get rid 
of the second term and solve the inequality in the quantity L(5;,) — L(s*) to get 





mee =a 


n 


L(,) —Lls*) < e( 


for some constant c. To end the proof, we plug the value of & = 1/2 following from Proposition 9. 


3. Performance Measures for Local Ranking 


Our main interest here is to develop a setup describing the problem of not only finding but also 
ranking the best instances. In the sequel, we build on the results from Section 2 and also on our 
previous work on the (global) ranking problem (Clémençon et al., To appear) in order to capture 
some of the features of the local ranking problem. The present section is devoted to the construction 
of performance measures reflecting the quality of ranking rules on a restricted set of instances. 


3.1 ROC Curves and Optimality in the Local Ranking Problem 


We consider the same statistical model as before with (X,Y ) being a pair of random variables over 
X x{—1, +1} and we examine ranking rules resulting from real-valued scoring functions s : X —> 
(0,À). The reference tool for assessing the performance of a scoring function s in separating the 
two populations (positive vs. negative labels) is the Receiver Operating Characteristic known as the 
ROC curve (van Trees, 1968; Egan, 1975). If we take the notations G,(z) = P{s(X) >z | Y =1} 
(true positive rate) and H,(z) = P{s(X) >z | Y = —1} (false positive rate), we can define the ROC 
curve, for any scoring function s, as the plot of the function: 


z+ (As(z),Gs(z)) 
for thresholds z € (0, A), or equivalently as the plot of the function: 
t= G,oH,*(1—-t) 


for t € (0,1). The optimal scoring function is the one whose ROC curve dominates all the others for 
all z € (0,A) (or t € (0,1)) and such a function actually exists. Indeed, by recalling the hypothesis 
testing framework in the classification model (see Remark 2) and using Neyman-Pearson’s Lemma, 
it is easy to check that the ROC curve of the function n(x) = P{Y =1 | X =x} dominates the ROC 
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curve of any other scoring function. We point out that the ROC curve of a scoring function s is 
invariant to strictly increasing transformations of s. 

In our approach, for a given scoring function s, we focus on thresholds z corresponding to the 
cut-off separating a proportion u € (0,1) of top scored instances according to s from the rest. Recall 
from Section 2 that the best instances according to s are the elements of the set Csu ={x € X | s(x) > 
Q(s,1—u)} where Q(s,1—u) is the (1 —u)-quantile of s(X). We set the following notations: 





a(s,u) = P{s(X) 
B(s,u) = P{s(X) 


(s,1—u) |Y =—1} =H, oF, *(1—y), 
(s,1—u) |Y =+1}=G,oF,*(1—-u). 











>Q 
2Q 


We propose to re-parameterize the ROC curve with the proportion u € (0,1) and then describe 
it as the plot of the function: 
ur (a(s,u),B(s,u)) , 


for each scoring function s. When focusing on the best instances at rate uo, we only consider the 
part of the ROC curve for values u € (0, uo). 

However attractive is the ROC curve as a graphical tool, it is not a practical one for developing 
learning procedures achieving straightforward optimization. The most natural approach is to con- 
sider risk functionals built after the ROC curve such as the Area Under an ROC Curve (known as 
the AUC or AROC, see Hanley and McNeil, 1982). Our goals in this section are: 


1. to extend the AUC criterion in order to focus on restricted parts of the ROC curve, 
2. to describe the optimal elements with respect to this extended criterion. 


We point out the fact that extending the AUC is not trivial. In order to focus on the best instances, 
a natural idea is to truncate the AUC (as in the approach by Dodd and Pepe (2003)). 


Definition 17 (Partial AUC) We define the partial AUC for a scoring function s and a rate uo of 
best instances as: 
O(s,Uo ) 
PARTAUC(s, uo) =| B(s,a)da. 
o 
We conjecture that such a criterion is not appropriate for local ranking. If it was, then we should 
have: Ys, PARTAUC(s,uo) < PARTAUC(N, uo), since the function n would provide the optimal 
ranking. However, there is strong evidence that this is not true as shown by a simple geometric 
argument which we describe below. 
In order to represent the partial AUC of a scoring function s, we need to locate the cut-off point 
given the constraint on the rate uo of best instances. We notice that a(s,u) and B(s, u) are related by 
a linear relation, for fixed u and p, when s varies: 


u = pB(s,u) + (1 — p)a(s, u) 


where p = P{Y = 1}. We denote the line plot of this relation by D(u, p) and call it the control line 
when u = uo. Hence, the part of the ROC curve of a scoring function s corresponding to the best 
instances at rate uo is the part going from the origin (0,0) to the intersection with the control line 
D(uo, p). The partial AUC is then the area under this part of the ROC curve (it corresponds to the 
shaded area in the left display of Figure 2). 
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Figure 2: ROC curves, control line D(u,, p) and partial AUC at rate uo of best instances. 


The optimality of n with respect to the partial AUC can then be questioned. Indeed, the closer to 
n the scoring function s is, the higher the ROC curve is, but at the same time the integration domain 
shrinks (right display of Figure 2) so that the overall impact on the integral is not clear. Let us now 
put things formally in the following lemma. 


Lemma 18 For any scoring function s, we have for all u € (0,1), 
B(s,u) < B(n,w), 
a(s,u) > a(n,u) . 

Moreover, we have equality only for those s such that Cs u, = C) 


Uo* 


PROOF. We show the first inequality. By definition, we have: 


B(s,u) =1—G,(Q(s,1—u)). 


Observe that, for any scoring function s, 





p(1—G;(Q(s,1—u)) = PLY =1,5(X) > O(s,1—u)} 
=E (n(x) HX € Csu}) G 














We thus have 





p (Gs(Q(s,1 —u) — Ga(Q (N, 1 —u)) = E(n(X) (IX € Ci} — HX € Cs,u))) 














=E (n(X)JHX € CI (UX € C}— HX E Csu})) 

















+E (n(X)HX € CGHHX € C} HX E Cs,u))) 











> —E(Q(n,1—u) HX ¢ Chy HX € Csu) 
+E (Q(n, 1—u)HX € Chla — HX € Csu))) 
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= Q(n,1—u)(1—u—1+u) =o. 
The second inequality simply follows from the identity below: 


1—u=pG,(Q(s,1—u))+(1—p)H,(Q(s,1—u)) . 


The previous lemma will be important when describing the optimal rules for local ranking cri- 
teria. But, at this point, we still do not know any nice criterion for the problem of ranking the 
best instances. Before considering different heuristics for extending the AUC criterion in the next 
subsections, we will proceed backwards and define our target, that is to say, the optimal scoring 
functions for our problem. 


Definition 19 (Class 5* of optimal scoring functions) The optimal scoring functions for ranking 
the best instances at the rate u, are defined as the members of the equivalence class (functions 
defined up to the composition with a nondecreasing transformation) of scoring functions s* such 
that: 


n(x) ifx E Cy, 
sot) = 
< inf n(z) if x £ C3 . 
ZECi 5 


uo 


Such scoring functions fulfill the two properties of locating the best instances (indeed Cy: u, = 
Ci) and ranking them as well as the regression function. 

Under the light of Lemma 18, we will see that a wide collection of criteria with the set 5* as the 
set of optimal elements could naturally be considered, depending on how one wants to weight the 
two types of error 1—B(s,u) (type H error in the hypothesis testing framework) and a(s,u) (type I 
error) according to the rate u € [o,u,]. However, not all the criteria obtained in this manner can be 
interpreted as generalizations of the AUC criterion for uo = 1. 


3.2 Generalization of the AUC Criterion 


In Clémencon et al. (To appear), we have considered the ranking error of a scoring function s as 
defined by: 
R(s) =P{(¥ —Y¥')(s(X) —s(X')) < 0}, 


where (X’,Y’) is ani.i.d. copy of the random pair (X,Y). 

Interestingly, it can be proved that minimizing the ranking error R(s) is equivalent to maximizing 
the well-known AUC criterion. This is trivial once we write down the probabilistic interpretation 
of the AUC: 





AUC(s) =P {s(X) > s(X')|Y 1, Y’ 1} 1 g R(s). 
2p(1—p) 








We now propose a local version of the ranking error on a measurable set C C X: 


R(s,C) =P {(s(X) —s(X"))(Y —Y') < 0, (X, X’) EC}. 
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On sets of the form C = Cs „ = {x E€ X | s(x) > O(s,1—u)} with mass equal to u, the local ranking 
error will be denoted by R(s,u) = R(s,Cs,u). 
We will also consider the local analogue of the AUC criterion: 


LocAUC(s,u) =P {s(X) > s(X’), s(X) > O(s,1—u) | Y=1,Y'=-1}. 


This criterion obviously boils down to the standard criterion for u = 1. However, in the case 
where u < 1, we will see that there is no equivalence between maximizing the LOCAUC criterion 
and minimizing the local ranking error s ++ R(s,u). Indeed, the local ranking error is not a relevant 
performance measure for finding the best instances. Minimizing it would solve the problem of 
finding the instances that are the easiest to rank. 

The following theorem states that optimal scoring functions s* in the set §* maximize the LO- 
CAUC criterion and that the latter may be decomposed as a sum of a ’power’ term and (the opposite 
of) a local ranking error term. 


Theorem 20 Let u, € (0,1). We have, for any scoring function s: 
Vs* €S*, LocAUC(s,u,) <LOCAUC(s*, uo). 


Moreover, the following relation holds: 
1 


Ys, LOCAUC(s,u,) = B(s, uo) -——~ 
2p(1—p) 


R(s,uUo) , 


where R(s,uo) = R(S, Csu). 


PROOF. We first introduce the notation for the Lebesgue-Stieltjes integral. Whenever @ is a cdf on 
R and y is integrable, the integral f y(z)d@(z) denotes the Lebesgue-Stieltjes integral (integration 
with respect to the measure v defined by v[a,b) = ọ(b) — ọ(a) for any real numbers a < b). If ọ 
has a density with respect to the Lebesgue measure, then the integral can be written as a Lebesgue 
integral: f w(z)d@(z) = J w(z)@’(z)dz. We shall use this convention repeatedly in the sequel. In 
particular, if Z is a random variable with cdf given by Fz then we can write: E(Z) = fz dFz(z). 
Now set vo = 1 — uo. Observe first that, by conditioning on X, we have: 




















LOCAUC (s, uo) = E (I{s(X) > s(X')} Hs(X) > O(s,vo)} |Y =1,¥'=—1) 








=E ( H{s(X) > O(s,vo)} E ( Hs(X) > s(X")}|¥'=-1,X) IY =1) 























= E(A;,(s(X)) Ks(X) > O(s,vo)} |Y =1) 





+00 
| H,(z) dG,(z) . 
O(s,Vo ) 


The last equality is obtained by using the fact that, conditionally on Y = 1, the random variable s(X ) 
has cdf G,. We now use that pG, = F, — (1 — p)H, and we obtain: 


+00 too 
pLOcAUC(s, ui) =| Touro ee! -p)| Haa 
Q(s,vo) Q (svo) 
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Recall now that o(s,v) = H, o F, *(1 — v) and make the change of variable 1 — v = F,(z) 
-++oo Vo 
[tle dete) = | -als av. 
Q(s,vo) o 
The second term is computed by making the change of variable a = H,(z) which leads to: 
-+00 1 
| H(z) dHs(z) =| ada. 
Q(s,vo) 1—O(5,Uo ) 


We have obtained: 





pLocAUCts,uo) = | (a —as,v)) av 1=P (4 (1—a5,¥9))”) . 


From Lemma 18, we have that, for any u € (0,1), the functional s > a(s,)) is minimized for s = N. 
Hence, the first part of Theorem 20 is established. 
Besides, integrating by parts, we get: 


+00 +00 
[ Hale) dGsle) BONE O da. 
Q(s,vo) Q(s,vo) 

The same change of variables as before leads to: 

+00 a (s,uo ) 

| Gale)dtie)= |” Blo.) do. 

Q(s,vo) o 

We then have another expression of the LOCAUC (s, uo): 


a(s,uo) 
LOCAUC (s, uo) =| B(s,0)da+B(s,u5)(1—Q(s,u,)) . 


o 


We develop further by expressing the product of œ and B in terms of probability. Using the 
independence of (X,Y) and (X’,Y'), we obtain: 


1 ! = 1 
A(S, Uo) B(s, uo) iy saan, Ae ) > Q(s, vo), Y i=" 1, Y = i} 


=P {s(X) > s(X’), s(X) As(X) > Q(s,vo) | Y =1, Y’ =—1} 


1 
F aama <s(X'), (XX) E Cm Y=1, Y'=—1} 


Q (S,uo ) 
=| Bie) do R uals 
o 2p(1—p) 


Combining this with the previous formula leads to the second statement of the theorem. E 
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Remark 21 (TRUNCATING THE AUC) Jn the theorem, we obviously recover the relation between 
the standard AUC criterion and the (global) ranking error when uy = 1. Besides, by checking the 
proof, one may relate the generalized AUC criterion to the partial AUC. As a matter of fact, we 
have: 

Ys, LOocAUC(s,uo) = PARTAUC(s, uo) + B(5,u0) — (5, Uo) B(s,U0) - 


The values &(5,U,) and B(s, uo) are the coordinates of the intersecting point between the ROC curve 
of the scoring function s and the control line D(uo, p). The theorem reveals that evaluating the local 
performance of a scoring statistic s(X ) by the truncated AUC as proposed in Dodd and Pepe (2003) 
is highly arguable since the maximizer of the functional s > PARTAUC(s, ua) is usually not in S*. 


3.3 Generalized Wilcoxon Statistic 


We now propose a different extension of the plain AUC criterion. Consider (X,,Y,), .-., (Xn, Yn), n 
i.i.d. copies of the random pair (X,Y). The intuition relies on a well-known relationship between 
Mann-Whitney and Wilcoxon statistics. Indeed, a natural empirical estimate of the AUC is the rate 
of concording pairs: 


AUC(s 





= —1,Y; = 1,s(X;) < s(X;)}, 


~ a[i, j<n 


with ng =n—n_=) ;_ HY; = +1). 
It will be useful to have in mind the definition of a linear rank statistic. 


Definition 22 (linear rank statistic) Consider Z,,...,Z, an i.i.d. sample with distribution F and 
a real-valued score generating function ®. Denote by Ri = rank(Z;) the rank of Z; in the sample 


Z,,-..,Zn. Then the statistic 
n R; 
- n+1 
1=1 


is a linear rank statistic. 


We refer to Hajek and Sidák (1967) and van de Vaart (1998) for basic results related to linear 
rank statistics. In particular, we recall that, for fixed s, the Wilcoxon statistic T,,(s) is a linear 
rank statistic for the sample s(X,),...,5(X,), with random weights c; = HY; = 1}, score generating 
function ®(v) =v: 

Da s i rank(s(X;)) . 
n+1 


i=1 


where rank(s(X;)) denotes the rank of s(X;) in the sample {s(X;),1 < j < n}. The following relation 


is well-known: 
n4n_ —— 


AUC(s)+ =T,(s) . 





n+(n4+1) 
n+1 2 


Moreover, the statistic T,,(s)/n+ is an asymptotically normal estimate of 





W(s) =E(F(s(X)) IY =1) . 











Note the theoretical counterpart of the previous relation may be written as 


W(s) =(1—p)AUC(s)+p/2. 
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Now, in order to take into account a proportion uo of the highest ranks only, we introduce the 
following quantity: 


Definition 23 (W-ranking performance measure) Consider the criterion related to the score gen- 
erating function ®,, (v) =v I{v > 1 — uo}: 





W (s,uo) =E(®y, (Fs(s(X))) |Y =1). 











It will be called the W-ranking performance measure at rate uo. 


Note that the empirical counterpart of W(s,u,) is given by T,(s,uo)/n+, with 


rank(s(X;)) ) 


n 
T,(s,uo) =) HY; = Ja. ( 
= n+1 


Using the results from the previous subsection, we can easily check that the following theorem 
holds. 


Theorem 24 We have, for all s: 
Ys* € S*, W(s,uo) <W(s*, Uo) . 
Furthermore, we have: 
W(s,uo) = ŽB(s,uo)(2—B(s,u0)) + (1 — p)LOCAUC(s, uo) . 


PROOF. We start by the definition of W: 





W(s,uo) = E (F(s(X))KF(s(X)) > 1—uo}| Y =1) 











+00 
=| F(z) dG,(2). 
Q(s,1—uo ) 


We recall that: F; = pG,+(1—p)H, which leads to: 


+00 +00 


casen -p)| H, (2) dG,(2) . 


Wii =p| i 
Q(S,1—uo 


Q(s,1—uo) 


The second term corresponds exactly to the LOCAUC. The first term is easily computed by a change 
of variable b = G;(z): 


+00 1 
[| adao=| bab. 
Q(s,1—uo) 1—B(s,Uo ) 


Elementary computations lead to the formula in the theorem. Moreover the application t + t(2—r) 
being nondecreasing for t € (0,1), we have, from Lemma 18: 


Vs" ES", B(s,u0)(2—B(s,u0)) < B(s% uo) (2—B(s*,uo)) . 


We also use the optimality of s* for LOCAUC established in Theorem 20 to conclude the proof. W 
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Remark 25 (EVIDENCE AGAINST ’TWO-STEP’” STRATEGIES) Jt is noteworthy that not all com- 
binations of B(s,u,) (or O(s,Uo)) and R(s,ug) lead to a criterion with S* being the set of optimal 
scoring functions. We have provided two non-trivial examples for which this is the case (Theorems 
20 and 24). But, in general, this remark should prevent from considering ’naive’ two-step Strategies 
for solving the local ranking problem. By ‘naive’ two-step strategies, we refer here to stagewise 
strategies which would, first, compute an estimate C of the set containing the best instances, and 
then, solve the ranking problem over C as described in Clémencon et al. (To appear). However, this 
idea combined with a certain amount of iterativeness might be the key to the design of efficient algo- 
rithms. In any case, we stress here the importance of making use of a global criterion, synthesizing 
our double goal: finding and ranking the best instances. 


Remark 26 (OTHER RANKING PERFORMANCE MEASURES) The ideas expressed above suggest 
that several ranking criteria can be proposed. For instance, one can consider maximization of other 
linear rank statistics with particular score generating functions ® and there are many possible 
choices which would emphasize the importance of the highest ranks. One of these choices is ®(v) = 
vP which corresponds to the p-norm push proposed by Rudin (2006) although the definition of 
the ranks in her work is slightly different. The Discounted Cumulative Gain criterion, studied in 
particular by Cossock and Zhang (2006) and Li et al. (2007), is of different nature and cannot be 
represented in a similar way. Other extensions can be proposed in the spirit of the tail strength 
measure from Taylor and Tibshirani (2006). The theoretical study of such criteria is still at an early 
stage, especially for the last proposal. We also point out that with such extensions, probabilistic 
interpretations and explicit connection to the AUC criterion seem to be lost. 


4. Empirical Risk Minimization of the Local AUC Criterion 


In the previous section, we have seen that there are various performance measures which can be 
considered for the problem of ranking the best instances. In order to perform the statistical analysis, 
we will favor the representations of LOCAUC and W which involve the classification error L(s, uo) 
and the local ranking error R(s, uo). By combining Theorems 20 and 24, we can easily get: 


2p(1— p)LOCAUC(s, uo) = (1 — p)(p +o) — (1 — p)L(s, uo) — R(S, Uo) 





and 
2pW (so) =Clpato)+ (PE 1) £65.49) = 22415340) — RUS) 


where C(p, uo) is a constant depending only on p and uo. 
We exploit the first expression and choose to study the minimization of the following criterion 
for ranking the best instances: 


M(s) = M(s,uo) =R(s, uo) + (1 — p)L(s, uo) . 


It is obvious that the elements of 5* are the optimal elements of the functional M( - ,u,) and we 
will now consider scoring functions obtained through empirical risk minimization of this criterion. 

More precisely, given n i.i.d. copies (X1, Y1), ..., (Xn, Yn) of (X,Y), we introduce the empirical 
counterpart: 


M,(s) = M,,(s, Uo) =R, (5) + —E,(s), 


n 


2692 


RANKING THE BEST INSTANCES 


with n_ =) _j_, HY; =—1} and 


Rls) = Z Wi MH) <0, SOEDA ST) 2 Bls,1 ea) 


Note that R,,(s) is expected to be close to the U-statistic of degree two 


1 
R,(s) = st > MR (X;,¥;)), 


with symmetric kernel 
ks((x,y), ^y) = (s(x) — s(x) (y—-y’) < 0, s(x) As(x') > Q(s,1—uo)} . 


The statistic R,(s) corresponds to an unbiased estimate of the local ranking error R(s, uo). The 
next result provides a standard error bound for the excess risk of the empirical risk minimizer over 
a class S of scoring functions: 

Sı =argminM,(s) . 
SES 
Proposition 27 Assume that conditions (i)-(ii) of Theorem 2 are fulfilled. Then, there exist con- 
stants c, and Cz such that, for any ò > o, we have: 


M(s,)—intea(s) < ef” en f PCA 
sES n P 


with probability larger than 1 — 6. 





PROOF. (SKETCH) The proof combines the argument used in the proof of Theorem 5 with the 
techniques used in establishing Proposition 2 in Clémengon et al. (2005). 


M(S,) — inf M(s) <2 (sup IRn (s) —R,(s)| + sup |R(s) Ral) 
se SES SES 


4+2(1—p) (splat) — La (s)| + sup|L(s) ~ Lal) +2|* —p| 
SES SES n 


The middle term may be bounded by applying the result stated in Theorem 5, while the last 
one can be handled by using Bernstein’s exponential inequality for an average of Bernoulli random 
variables. By combining Lemma 1 in Clémencon et al. (2005) with the Chernoff method, we can 
deal with the U-process term sup,<5|R(s) —Ry(s)|. Finally, the term sup, <5 |Rn(s) —R,(s)| can also 
be controlled by repeating the argument in the proof of Theorem 5. The only difference here is that 
we have to consider the U-process term 


sup aan Ze: (X; Y;)) —ElKs,((X,¥), (X,Y) 














with 

Ks4((x,y), x,y D) = Hls) — s(x") )(y—y’) > 0, s(x) Asl’) >t}. 
For deriving first-order results with such a process, we refer to the same type of argument as used 
in Clémengon et al. (2005). E 
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Remark 28 (ABOUT THE POSSIBILITY OF DERIVING FAST RATES) By checking the proof sketch, 
it turns out that sharper bounds may be achieved for the U-process term. Indeed, it is a simple 
variation of our previous work in Clémencon et al. (2005) where we have used Hoeffding’s de- 
composition in order to grasp the deep structure of the underlying statistic. Here we will need, in 
addition, condition (iii) to hold for all u € (0,uo]. Indeed, if we localize our low-noise assumption 
from Clémencon et al. (2005), it takes the following form: there exist constants & € (0,1) and B > o 
such that, for allt > 0, we have 


WEC, = P{in(X)—n(x)| <1} < Bree . 


It is easy to see that this is equivalent to condition (iii) for all u € (0,uo]: there exist constants 
a € (0,1) and B > o such that, for all t > o, we have 


Vu €(0,uol, — P{In(X)-O(n,1-u)| <t} < Bree. 


However, in the present formulation where p is assumed to be unknown, it looks like this improve- 
ment will be spoiled by the ‘proportion term’ which will still be of the order of a O(n—/?). 


Remark 29 (ABOUT THE EXTENSION TO CONVEX RISK MINIMIZATION) An important topic in 
classification theory is convex risk minimization. Understanding the connection between classifi- 
cation error and its convex surrogates has permitted to understand the behavior of practical algo- 
rithms such as boosting and SVM from a statistical perspective (see Boucheron et al., 2005 for 
an account on this aspect and Bartlett et al., 2006 for state-of-the-art results). A natural question 
which arises here is whether the consistency results on local ranking can be extended in this spirit. 
Note that, if we do not focus on best instances and consider the whole AUC as a performance 
criterion, it is straightforward to obtain consistency and universal rates of convergence for convex 
risk minimization (as explained in Clémencon et al. (To appear)). In the case of local ranking as we 
introduced it, this extension is less straightforward since the decision rule represented here by the 
scoring function s appears under the empirical quantile O(s,v) in the criterion. We refer to Rudin 
(2006), Cossock and Zhang (2006) and Li et al. (2007) where convex risk minimization strategies in 
the context of ranking are discussed. 


5. Conclusion 


In the present work, we have presented theoretical work on local ranking. In the first part of the 
paper (Section 2), we considered a subproblem that we called the classification with mass constraint 
problem. The scope was to establish and study an empirical risk minimization strategy for only 
finding, and not ranking, the best instances. In this case, one attempts to minimize the classification 
error over classifiers that contain a fixed proportion u of observations. This constraint leads to 
empirical risk functionals which involve an empirical quantile indexed by the class of candidate 
scoring functions and can be seen as linear signed rank statistics. We then provide a consistency 
result and discuss the noise assumptions required to derive fast rates of convergence in this setup. 
These assumptions require a limited regularity of the underlying distributions which prevents the 
fast rate from dropping below the order of n~?/3, The second part of the paper (Section 3) is 
dedicated to the introduction of new performance measures for local ranking related to the ROC 
curve and the AUC criterion. We show that the AUC can be extended in several ways (partial AUC 
and local AUC) but not all these extensions are tailored for the local ranking problem. In particular 
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the naive extension known as the partial AUC is not appropriate and requires a correction term. We 
also introduce the optimal scoring functions which should be considered as the target of any local 
ranking method. We also discuss other extensions based on Wilcoxon statistics, the W-ranking 
performance measure, for which optimal rules can also be recovered. In the last section of the 
paper (Section 4), the problem of ranking the best instances is studied from a statistical perspective. 
A consistency result is provided for empirical risk minimization of the W-ranking performance 
measure. 
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Appendix A. 


In this section, we provide the proof of Proposition 15. 
PROOF. First, for all (s,v) € S x (0,1) set 


Valsv) == $ YMS) < O(s,v)}—K(s,v) . 


We have the following decomposition: 
Wwe [o,1], &,(s,v)—K(s,v) = Vals, Fso ÊF (v))+K(s, F, oF (v)) —K(s,v) . 
We shall first prove that 
Vals, Fs oF, (vo)) = Va (S, vo) + Op(n™). 
We denote by A(s,€) the event { |F of (vo) — vo < e}. On the event A(s,€), we have: 


[Va (s, Fs 0 Fy * (vo) — Va (s, vo )| < sup Vals, v) — Vals, vo). 


v : [v—vo|<E€ 


We bound the right hand side for fixed £, by making use of an argument from van de Geer (2000). 
First, we need to put things into the right format. Set: 


Vals) Valssvo) = 7 (wits) —uils,vo)) , 











where u;(s,v) = YiI{s(X;) < O(s,v) < o} -E(YI{s(X) < O(s,v)}) for s € S and v € (0,1). We ob- 
serve that 





lui(s,v) —ui(s,Vo}] < dilv, vo), 


where 
di(v, vo) = Hs(X;) € [O(s,vAvo), O(s,v V Vo )]} + lv — vol « 
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Denote by 
dlv,vo) == J  Hs(X:) € [O(s,vAv),Ols,vV vol} +1 — vol - 


a distance over R. Set also: 
R(e)= sup dv, vo). 


v : |v—vo|<€ 
and observe that 


n 
1 
Re) =— )_Ms(Xi) € lQ(s,vo —£),Q(s,vo +£)]} +e. 
i=1 
We then have, by applying Lemma 8.5 from van de Geer (2000), for nt? / R?(e) sufficiently large, 


X x} scen] au 
PSAN age (2) ’ 


for some positive constants c and C. It remains to integrate out and, for this purpose, we introduce 
the event: 








v: [v—vo [<e 


z sup |V,(s,v) —Vals,vo)| > t 


Vx >0, A(x) = {3e—x < R(e) < 3¢ +x}. 


7 cnt? cnt? = 
E («xp Es }) < exp { } +P {aay} : 


Now, we have, by Bernstein’s inequality: 


PAW} =2P ŽB(n,2£)— 28 >x} < zexp{ -2E 


We then have: 




















where we have used the notation B(n,2€) for a binomial (n,2€) random variable. We can take 
x = O(t /v/£) and assume also x = o(€) to get, for nt? /e? large enough, 


cnt? 
P sup IVi(s,v) —Vals,vo)| >t < Cexp a e2 ’ 


v : |p—volse 





for some positive constants c and C. This can be reformulated, by writing that the following bound 
holds, with probability larger than 1 — 6/2, 


log(2C/5 
sup Va(s,v) — Va (s, vo)| < € logt rā. 
v : [v—vo|<E nc 


We recall that, by the triangle inequality and Dvoretsky-Kiefer-Wolfowitz theorem, if we take € = 


ca tos(2/6)_ we have P{A(s,€)} > 1 — 8/2. It follows that, with probability larger than 1 — 6, we 
have, for some constant K: 


IV, (s, E; o Ê (vo)) —Vals,vo)] < x (=A) 


n 
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for any s € S. Now it remains to deal with the second term K (s, F; o Po (vo)) —K(s,v,). Therefore, 
by the differentiability assumption (iv), we have: Vs € S, 


sup {K(s,v) —K(s,vo) —(v—vVo)K'(s,Vo)}= O(8?), asd—0. 


lv—v9|<6 
Since |F, of-(y,)) —v,| = Op(n—/?), we get that 
K(s, Fs of (vo)) —K(s, vo) = K' (s, vo) (F; o Ê (vo) —Vo)+Op(n™), asn=> oœ. 


Moreover, as 
F,oF (vo) Vo = —(f, oF, * (v9) — vo) te Op(n *) ; 


we finally obtain that 


K(s,F;o Ê (vo)) — K(s,vo) = —K' (s, vo) (Ês o F> (vo) — vo) +Op(n*) : 
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